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Abstract. Macro-economic models describe the dynamics of economic 
quantities. The estimations and forecasts produced by such models play a 
substantial role for financial and political decisions. In this contribution 
we describe an approach based on genetic programming and symbolic 
regression to identify variable interactions in large datasets. In the pro- 
posed approach multiple symbolic regression runs are executed for each 
variable of the dataset to find potentially interesting models. The re- 
sult is a variable interaction network that describes which variables are 
most relevant for the approximation of each variable of the dataset. This 
approach is applied to a macro-economic dataset with monthly observa- 
tions of important economic indicators in order to identify potentially 
interesting dependencies of these indicators. The resulting interaction 
network of macro-economic indicators is briefly discussed and two of the 
identified models are presented in detail. The two models approximate 
the help wanted index and the CPI inflation in the US. 
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1 Motivation 

Macro-economic models describe the dynamics of economic quantities of coun- 
tries or regions, as well as their interaction on international markets. Macro- 
economic variables that play a role in such models are for instance the unem- 
ployment rate, gross domestic product, current account figures and monetary 
aggregates. Macro-economic models can be used to estimate the current eco- 
nomic conditions and to forecast economic developments and trends. Therefore 
macro-economic models play a substantial role in financial and political deci- 
sions. 

It has been shown by Koza that genetic programming can be used for econo- 
metric modeling [B], [?]■ He used a symbolic regression approach to rediscover 



the well-known exchange equation relating money supply, price level, gross na- 
tional product and velocity of money in an economy, from observations of these 
variables. 

Genetic programming is an evolutionary method imitating aspects of biolog- 
ical evolution to find a computer program that solves a given problem through 
gradual evolutionary changes starting from an initial population of random pro- 
grams [7] . Symbolic regression is the application of genetic programming to find 
regression models represented as symbolic mathematical expressions. Symbolic 
regression is especially effective if little or no information is available about the 
studied system or process, because genetic programming is capable to evolve 
the necessary structure of the model in combination with the parameters of the 
model. 

In this contribution we take up the idea of using symbolic regression to gen- 
erate models describing macro-economic interactions based on observations of 
economic quantities. However, contrary to the constrained situation studied in 
[B] , we use a more extensive dataset with observations of many different economic 
quantities, and aim to identify all potentially interesting economic interactions 
that can be derived from the observations in the dataset. In particular, we de- 
scribe an approach using GP and symbolic regression to generate a high level 
overview of variable interactions that can be visualized as a graph. 

Our approach is based on a large collection of diverse symbolic regression 
models for each variable of the dataset. In the symbolic regression runs the most 
relevant input variables to approximate each target variable are determined. This 
information is aggregated over all runs and condensed to a graph of variable in- 
teractions providing a coarse grained high level overview of variable interactions. 

We have applied this approach on a dataset with monthly observations of 
economic quantities to identify (non-linear) interactions of macro-economic vari- 
ables. 

2 Modeling Approach 

The main objective discussed in this contribution is the identification of all po- 
tentially interesting models describing variable relations in a dataset. This is a 
broader aim than usually followed in a regression approach. Typically, modeling 
concentrates on a specific variable of interest (target variable) for which an ap- 
proximation model is sought. Our aim resembles the aim of data mining, where 
the variable of interest is often not known a-priori and instead all quantities are 
analyzed in order to find potentially interesting patterns [3]. 

2.1 Comprehensive Symbolic Regression 

A straight forward way to find all potentially interesting models in a data set is 
to execute independent symbolic regression runs for all variables of the dataset 
building a large collection of symbolic regression models. This approach of com- 
prehensive symbolic regression over the whole dataset is also followed in this 
contribution. 



Especially in real world scenarios there are often dependencies between the 
observed variables. In symbolic regression the model structure is evolved freely, so 
any combination of input variables can be used to model a given target variable. 
Even if all input variables are independent, a given function can be expressed 
in multiple different ways which are all semantically identical. This fact makes 
the interpretation of symbolic regression models difficult as each run produces 
a structurally different result. If the input variables are not independent, for 
instance a variable x can be described by a combination of two other variables 
yandz, this problem is emphasized, because it is possible to express semanti- 
cally equivalent functions using differing sets of input variables. A benefit of the 
comprehensive symbolic regression approach is that dependencies of all variables 
in the dataset are made explicit in form of separate regression models. When 
regression models for dependencies of input variables are known, it is possible 
to detect alternative representations. 

Collecting models from multiple symbolic regression runs is simple, but it 
is difficult to detect the actually interesting models [3]. We do not discuss in- 
terestingness measures in this contribution. Instead, we propose a hierarchical 
approach for the analysis of results of multiple symbolic regression runs. On a 
high level, only aggregated information about relevant input variables for each 
target variable is visualized in form of a variable interaction network. If a specific 
variable interaction seems interesting, the models which represent the interaction 
can be analyzed in detail. 

Information about relevant variable interactions is implicitly contained in the 
symbolic regression models and distributed over all models in the collection. In 
the next section we discuss variable relevance metrics for symbolic regression 
which can be used to determine the relevant input variables for the approxima- 
tion of a target variable. 



2.2 Variable Relevance Metrics for Symbolic Regression 



Information about the set of input variables necessary to describe a given depen- 
dent variable is often valuable for domain experts. For linear regression model- 
ing, powerful methods have been described to detect the relevant input variables 
through variable selection or shrinkage methods 0j. However, if non- linear mod- 
els are necessary then variable selection is more difficult. It has been shown that 
genetic programming implicitly selects relevant variables |H] for symbolic regres- 
sion. Thus, symbolic regression can be used to determine relevant input variables 
even in situations where non-linear models are necessary. 

A number of different variable relevance metrics for symbolic regression have 
been proposed in the literature [T2]. In this contribution a simple frequency- 
based variable relevance metric is proposed, that is based on the number of 
variable references in all solution candidates visited in a GP run. 



2.3 Frequency-based Variable Relevance Metric 

The function relevancefrcq(a;i) is an indicator for tlie relative relevance of vari- 
able Xi. It is calculated as the average relative frequency of variable references 
freq%(xi, Pop^) in population Pop^^ at generation g over all G generations of one 
run, 

1 ^ 

relevancefroq(xi) = ^ X! fr6q%(a:i, Pop^). (1) 
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The relative frequency freq% [xi , Pop) of variable Xi in a population is the 
number of references freq(xi , Pop) of variable Xi over the number of all variable 
references, 

r , o N E.GPopRefCount(x,,s)) 

freq% (x, , Pop) = ^^^^j — — , (2) 

Efc=i EsGPop RefCount(a:;fc, s) 

where the function RefCount(a:i, s) simply counts all references to variable Xi in 
model s. 

The advantage of calculating the variable relevance for the whole run instead 
of using only the last generation is that the dynamic behavior of variable rele- 
vance over the whole run is taken into account. The relevance of variables typi- 
cally differs over multiple independent GP runs, because of the non-deterministic 
nature of the GP process. Therefore, the variable relevancies of one single GP 
run cannot be trusted fully as a specific variable might have a large relevance in 
a single run simply by chance. Thus, it is desirable to analyze variable relevance 
results over multiple GP runs in order to get statistically significant results. 



3 Experiments 

We applied the comprehensive symbolic regression approach, described in the 
previous sections, to identify macro-economic variable interactions. In the fol- 
lowing sections the macro-economic dataset and the experiment setup are de- 
scribed. 



3.1 Data Collection and Preparation 

The dataset contains monthly observations of 33 economic variables and indexes 
from the United States of America, Germany and the Euro zone in the time span 
from 01/1980 - 07/2007 (331 observations). The time series were downloaded 
from various sources and aggregated into one large dataset without missing val- 
ues. 

Some of the time series in the dataset have a general rising trend and are 
thus also pairwise strongly correlated. The rising trend of these variables is not 
particularly interesting, so the derivatives (monthly changes) of the variables are 
studied instead of the absolute values. The derivative values {d{x) in Figure [1]) 
are calculated using the five point formula for the numerical approximation of 
the derivative [10' without prior smoothing. 



3.2 Experiment Configuration 



The goal of the modehng step is to identify the network of relevant variable 
interactions in the macro-economic dataset. Thus, several symbolic regression 
runs were executed to produce approximation models for each variable as a 
function of the remaining 32 variables in the dataset. In this step symbolic 
regression models are generated for each of the 33 variables in separate GP 
runs. For each target variable 30 independent runs are executed to generate a 
set of different models for each variable. 

The same parameter settings were used for all runs. Only the target vari- 
able and the list of allowed input variables were adapted. The GP parameter 
settings for our experiments are specified in Table [TJ We used rather standard 
GP configuration with tree-based solution encoding, tournament selection, sub- 
tree swapping crossover, and two mutation operators. The fitness function is the 
squared correlation coefficient of the model output and the actual values of tar- 
get variables. Only the final model is linearly scaled to match the location and 
scale of the target variable [S]. The function set includes arithmetic operators 
(division is not protected) and additionally symbols for the arithmetic mean, the 
logarithm function, the exponential function and the sine function. The terminal 
set includes random constants and all 33 variables of the dataset except for the 
target variable. The variable can be either non-lagged or lagged up to 12 time 
steps. All variables contained in the dataset are listed in Figures [T] ani^ 

Two recent adaptations of the algorithm are included to reduce bloat and 
overfitting. Dynamic depth limits TP with an initial depth limit of seven are 
used to reduce the amount of bloat. An internal validation set is used to reduce 
the chance of overfitting. Each solution candidate is evaluated on the training 
and on the validation set. Selection is based solely on the fitness on the training 
set; the fitness on the validation set is used as an indicator for overfitting. Models 
which have a high training fitness but low validation fitness are likely to be over- 
fit. Thus, the Spearman's rank correlation p(Fitnesstrain, FitnesSvai) of training- 
and validation fitness of all solution candidates in the population is calculated 
after each generation. If the correlation of training- and validation fitness in the 
population drops below a certain threshold the algorithm is stopped. 

The dataset has been split into two partitions; observations 1-300 are used 
for training, observations 300-331 are used as a test set. Only observations 13- 
200 are used for fitness evaluation, the remaining observations of the training set 
are used as internal validation set for overfitting detection and for the selection 
of the final (best on validation) model. 

4 Results 

For each variable of the dataset 30 independent GP runs have been executed 
using the open source software HeuristicLab. The result is a collection of 990 
models, 30 symbolic regression models for each of the 33 variables generated in 
990 GP runs. The collection of all models represents all identified (non-linear) 
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Parent selection 


Tournament (group size = 7) 


Replacement 


1-Elitism 


Initialization 


PTC2 [g 


Crossover 


Sub-tree-swapping 


Mutation 


7% One-point, 7% sub-tree replacement 


Tree constraints 


Dynamic depth limit (initial limit = 7) 


Model selection 


Best on validation 


Stopping criterion 


p(Fitnesstrain,FitnesSvai) < 0.2 
(maximization) 


Fitness function 


Function set 


+ , -, *, /, avg, log, exp, sin 


Terminal set 


constants, variables, lagged variables (t-12) . . . (t-1) 



Table 1. Genetic programming parameters. 



interactions between all variables. Figure [T] shows the box-plot of the squared 
Pearson's correlation coefficient (R^) of the model output and the original values 
of the target variable on the test set for the 30 models for each variable. 

4.1 Variable Interaction Network 

In Figure [5] the three most relevant input variables for each target variable are 
shown where an arrow (a — > b) means that variable a is a relevant variable 
for modeling variable b. In the interaction network variable a is connected to b 
(a b) ii a is among the top three most relevant input variables averaged over 
all models for variable b, where the variable relevance is calculated using the 
metric shown in Equation [T] The top three most important input variables are 
determined for each of the 33 target variables in turn and Graph Viz is used to 
layout the resulting network shown in Figure [2j 

The network of relevant variables shows many strong double-linked variable 
relations. GP discovered strongly related variables, for instance exports and im- 
ports of Germany, consumption and existing home sales, building permits and 
new home sales, Chicago PMI and non-farm payrolls and a few more. GP also 
discovered a chain strongly related variables connecting the producer price in- 
dexes of the euro zone, Germany and the US with the US CP I inflation. 

A large strongly connected cluster that contains the variables unemployment, 
capacity utilization, help wanted index, consumer confidence, U.Mich, expecta- 
tions, U.Mich, conditions, U.Mich. 1-year inflation, building permits, new home 
sales, and manufacturing payrolls has also been identified by our approach. 

Outside of the central cluster the variables national activity index, CPI in- 
flation, non-farm payrolls and leading indicators also have a large number of 
outgoing connections indicating that these variables play an important role for 
the approximation of many other variables. 




Fig. 2. Variable interaction network of macro-economic variables identified through 
comprehensive symbolic regression and frequency- based variable relevance metrics. 
This figure has been plotted using Graph Viz. 



4.2 Detailed Models 



The variable interaction network only provides a course grained high level view 
on the identified macro-economic interactions. To obtain a better understanding 
of the identified macro-economic relations it is necessary to analyze single models 
in more detail. Because of space constraints we cannot give a full list of the best 
model identified for each variable in the data set. We selected two models for 
the Help wanted index and CPI inflation instead, which are discussed in more 
detail in the following sections. 

The help wanted index is calculated from the number of job advertisements 
in major newspapers and is usually considered to be related to the unemploy- 
ment rate [2], [I]- The model for the help wanted index shown in Equation |3] has 
a value of 0.82 on the test set. The model has been simplified manually and 
constant factors are not shown to improve comprehensibility. The model includes 
the manufacturing payrolls and the capacity utilization as relevant factors. In- 
terestingly, the unemployment rate which was also available as input variable 
is not used, instead other indicators for economic conditions {Chicago PMI, U. 
Mich cond.) are included in the model. Interestingly the model also includes the 
building permits and wholesale price index of Germany. 

Help wanted index — Building permits + Mfg payroll + Mfg Payroll(t — 5) 

+ Capacity utilization + Wholesale price index (GER) (3) 
+ Chicago PMI + U. Mich cond.(i - 3) 

Figure [3] shows a line chart for the actual values of the help wanted index in 
the US and the estimated values of the model (Equation [S]) over the whole time 
span covered by the dataset. 

Help wanted index 
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Fig. 3. Line chart of the actual value of the US Help wanted index and the estimated 
values produced by the model (Equation [S]). Test set starts at index 300. 

The consumer price index measures the change in prices paid by customers 
for a certain market basket containing goods and services, and is measure for the 



inflation in an economy. The output of tlie model for the CPI inflation in the US 
shown in EquationH]is very accurate with a squared correlation coefficient of 0.93 
on the test set. This model has also been simplified manually and again constant 
factors are not shown to improve comprehensibility. The model approximates the 
consumer price index based on the unemployment, car sales, New home sales, 
and the consumer confidence. 

CPI inflation = Unemployment + Domestic car sales + New home sales 

+ log(New home sales(t — 4) + New home sales(t — 2) (4) 
+ Consumer conf.(t — 1) + Unemployment (t — 5)) 

Figure 2] shows a line chart for the actual values of the CPI inflation in the 
US and the estimated values of the model (Equation |4]) over the whole time span 
covered by the dataset. Notably the drop of the CPI in the test set (starting at 
index 300) is estimated correctly by the model. 




5 Conclusion 

The application of the proposed approach on the macro-economic dataset re- 
sulted in a high level overview of macro-economic variable interactions. In the 
experiments we used dynamic depth limits to counteract bloat and an internal 
validation set to detect overfitting using the correlation of training- and valida- 
tion fitness. Two models for the US Help wanted index and the US CPI inflation 
have been presented and discussed in detail. Both models are rather accurate 
also on the test set and are relatively comprehensible. 

We suggest using this approach for the exploration of variable interactions 
in a dataset when approaching a complex modeling task. The visualization of 
variable interaction networks can be used to give a quick overview of the most 



relevant interactions in a dataset and can help to identify new unknown in- 
teractions. The variable interaction network provides information that is not 
apparent from analysis of single models, and thus supplements the information 
gained from detailed analysis of single models. 
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